In this notebook, we are analyzing the factors that affect the teams' success. To fully answer this question, we explored the 'Elo' dataset and the 'NFL Favorite Teams' dataset. Combining these two data sets allows us to use the author of the 'NFL Favorite Teams' unique ratings for each team across a broad array of categories and measure those ratings against teams' ability to accumulate wins.
In order to answer this question we need to use the following data sets: nfl_elo (https://github.com/fivethirtyeight/data/tree/master/nfl-elo) and team-picking-categories (https://github.com/fivethirtyeight/data/blob/master/nfl-favorite-team).
The ratings from the 'NFL Favorite Teams' dataset was published in 2016. We assume her ratings were most informed by the 5 years preceeding her publishing date, and therefore restricted our game data to those played from 2010-2015. We filtered the dataset down to those relevant seasons, made the titles more accessible to the audience, and added columns to document the winning team and whether the home team won. There were a few ties in the dataset, three to be exact, so we dropped those values. We then merged the Elo dataset with the 'NFL Favorite Teams' datasetm, dropped the columns we weren't using, made the labeling uniform, and renamed the abbreviations to make the labels more accessible to our audience. We finished by ensuring our datatypes were sensible and added a column to keep track of the total number of wins for each team.
(a) We first created a pairplot to see if there is any relationship between varaibles - there wasn't anything of note in this visual.
(b) We created some initial catplots to show how the ratings for Affordability, Behavior, and Coaching measured against total wins. From these graphs, we are able to see that ratings for Coaching ability look slightly more correlated with high win rates.
(c) We decided that a tree map would be the most effective visual to provide an executive overview of our data. We organized it so that it scaled the boxes for each team based on their number of wins compared to other teams, and it uses a rating scale to show how well each team does in each category.
(d) Finally, we decided to use a radar chart to do a one-to-one comparison of the best and the worst teams' ratings across attributes. The radar chart helps us do an effective comparison to highlight differences between the best and worst teams. It also serves an important purpose in giving our audience insight into our methodology. We felt it was important to include the chart for these reasons.
Based on our exploration of the dataset, we refined our subquestions from the first assignment as follows:
(a) Home Field Advantage - How Much Does it Really Matter?
(b) Show Me the Money - Do Teams With Fans that Spend More Win More?
(c) Do the Crime, Do the Time - Does Player Behavior Impact Team Performance?
(d) Leading from the Front - How Important is Coaching to Team Success?
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
elo=pd.read_csv('https://projects.fivethirtyeight.com/nfl-api/nfl_elo.csv', parse_dates=['date'],
usecols=['date','team1','team2','score1','score2'])
elo.sample(10)
| date | team1 | team2 | score1 | score2 | |
|---|---|---|---|---|---|
| 12055 | 2003-09-28 | BAL | KC | 10.0 | 17.0 |
| 13896 | 2010-09-19 | WSH | HOU | 27.0 | 30.0 |
| 13553 | 2008-12-14 | MIA | SF | 14.0 | 9.0 |
| 14491 | 2012-10-14 | ATL | OAK | 23.0 | 20.0 |
| 11629 | 2001-11-25 | CLE | CIN | 18.0 | 0.0 |
| 12135 | 2003-11-09 | JAX | IND | 28.0 | 23.0 |
| 11959 | 2002-12-21 | MIN | MIA | 20.0 | 17.0 |
| 8110 | 1987-09-20 | NYG | DAL | 14.0 | 16.0 |
| 16735 | 2020-12-13 | CAR | DEN | 27.0 | 32.0 |
| 16957 | 2021-11-14 | DEN | PHI | 13.0 | 30.0 |
To simplify our analysis of questions on whether or not factors like coaching ability, fan expenditure, and player behavior affect the number of wins teams accumulate, we decided to use the metrics given by the 'NFL Favorite Team' dataset. The author's analysis was published in 2016. Therefore, we assume her ratings were most informed by the 5 years preceeding her publishing date, and therefore restricted our game data to those played from 2010-2015.
In the following code, we restrict our dataset to games that occurred within the given timeframe, from 2010 to 2015.
opening_day = '2010-09-09'
closing_day = '2015-02-01'
condition1 = elo['date'] >= opening_day
condition2 = elo['date'] <= closing_day
elo = elo[condition1 & condition2]
elo
| date | team1 | team2 | score1 | score2 | |
|---|---|---|---|---|---|
| 13871 | 2010-09-09 | NO | MIN | 14.0 | 9.0 |
| 13872 | 2010-09-12 | NYG | CAR | 31.0 | 18.0 |
| 13873 | 2010-09-12 | PHI | GB | 20.0 | 27.0 |
| 13874 | 2010-09-12 | JAX | DEN | 24.0 | 17.0 |
| 13875 | 2010-09-12 | WSH | DAL | 13.0 | 7.0 |
| ... | ... | ... | ... | ... | ... |
| 15201 | 2015-01-11 | GB | DAL | 26.0 | 21.0 |
| 15202 | 2015-01-11 | DEN | IND | 13.0 | 24.0 |
| 15203 | 2015-01-18 | SEA | GB | 28.0 | 22.0 |
| 15204 | 2015-01-18 | NE | IND | 45.0 | 7.0 |
| 15205 | 2015-02-01 | NE | SEA | 28.0 | 24.0 |
1335 rows × 5 columns
We adjust the column names to make them more interpretable to our audience.
elo = elo.rename(columns={'date':'Game_Date', 'team1':'Home_Team', 'team2':'Away_Team',
'score1':'Home_Score', 'score2':'Away_Score'})
We reset the index of our dataframe, which now contains only the desired NFL season.
elo.reset_index(drop=True, inplace=True)
Now that we have the refined dataset, we create a column that shows the winning team. We will use this column later to document total wins.
conditions = [(elo['Home_Score'] > elo['Away_Score']),
(elo['Away_Score'] > elo['Home_Score'])]
values = [elo['Home_Team'], elo['Away_Team']]
elo['Winning_Team'] = np.select(conditions, values, default='Tie')
elo.sample(5)
| Game_Date | Home_Team | Away_Team | Home_Score | Away_Score | Winning_Team | |
|---|---|---|---|---|---|---|
| 1235 | 2014-11-23 | ATL | CLE | 24.0 | 26.0 | CLE |
| 158 | 2010-11-21 | NYJ | HOU | 30.0 | 27.0 | NYJ |
| 896 | 2013-10-20 | PIT | BAL | 19.0 | 16.0 | PIT |
| 1215 | 2014-11-13 | MIA | BUF | 22.0 | 9.0 | MIA |
| 583 | 2012-09-30 | TB | WSH | 22.0 | 24.0 | WSH |
There is a chance that the scores were equal and the game resulted in a tie. We see from the following code that there are 3 games in this dataset that ended in ties.
elo.Winning_Team.unique()
array(['NO', 'NYG', 'GB', 'JAX', 'WSH', 'MIA', 'NE', 'TEN', 'PIT', 'HOU',
'SEA', 'TB', 'CHI', 'ARI', 'BAL', 'KC', 'ATL', 'IND', 'CIN', 'DEN',
'PHI', 'OAK', 'NYJ', 'LAC', 'LAR', 'MIN', 'DAL', 'CLE', 'DET',
'SF', 'CAR', 'BUF', 'Tie'], dtype=object)
elo.groupby('Winning_Team').count().tail()
| Game_Date | Home_Team | Away_Team | Home_Score | Away_Score | |
|---|---|---|---|---|---|
| Winning_Team | |||||
| SF | 55 | 55 | 55 | 55 | 55 |
| TB | 27 | 27 | 27 | 27 | 27 |
| TEN | 30 | 30 | 30 | 30 | 30 |
| Tie | 3 | 3 | 3 | 3 | 3 |
| WSH | 28 | 28 | 28 | 28 | 28 |
elo.drop(elo.index[elo['Winning_Team'] == 'Tie'], inplace = True)
The ties have been removed from the dataset.
elo.groupby('Winning_Team').count().tail()
| Game_Date | Home_Team | Away_Team | Home_Score | Away_Score | |
|---|---|---|---|---|---|
| Winning_Team | |||||
| SEA | 57 | 57 | 57 | 57 | 57 |
| SF | 55 | 55 | 55 | 55 | 55 |
| TB | 27 | 27 | 27 | 27 | 27 |
| TEN | 30 | 30 | 30 | 30 | 30 |
| WSH | 28 | 28 | 28 | 28 | 28 |
Now we add a column to document whether the home team won or lost the game, which we will use to answer the question of whether home field advantage really matters.
conditions = [(elo['Winning_Team'] == elo['Home_Team'])]
values = [True]
elo['Home_Victory'] = np.select(conditions, values)
elo.sample(5)
| Game_Date | Home_Team | Away_Team | Home_Score | Away_Score | Winning_Team | Home_Victory | |
|---|---|---|---|---|---|---|---|
| 953 | 2013-11-17 | PHI | WSH | 24.0 | 16.0 | PHI | 1 |
| 326 | 2011-10-02 | LAR | WSH | 10.0 | 17.0 | WSH | 0 |
| 1015 | 2013-12-15 | IND | HOU | 25.0 | 3.0 | IND | 1 |
| 72 | 2010-10-10 | WSH | GB | 16.0 | 13.0 | WSH | 1 |
| 868 | 2013-10-06 | DAL | DEN | 48.0 | 51.0 | DEN | 0 |
The Elo dataset is now ready for analysis. We can move forward by reading in our second dataset - 'NFL Favorite Teams.'
tpc = pd.read_csv('https://raw.githubusercontent.com/fivethirtyeight/data/master/nfl-favorite-team/team-picking-categories.csv')
tpc.sample(5)
| TEAM | BMK | UNI | CCH | STX | SMK | AFF | SLP | NYP | FRL | BNG | TRD | BWG | FUT | PLA | OWN | BEH | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 10 | Houston Texans | 87 | 32 | 55 | 71 | 13 | 48 | 52 | 23 | 81 | 35 | 0 | 39 | 61 | 74 | 68 | 95 |
| 22 | Oakland Raiders | 10 | 97 | 32 | 3 | 90 | 65 | 0 | 0 | 42 | 0 | 65 | 94 | 35 | 19 | 19 | 69 |
| 4 | Buffalo Bills | 3 | 81 | 58 | 32 | 97 | 97 | 55 | 81 | 55 | 61 | 48 | 74 | 68 | 58 | 77 | 69 |
| 29 | New York Jets | 71 | 58 | 23 | 23 | 29 | 3 | 29 | 100 | 16 | 10 | 19 | 68 | 52 | 13 | 23 | 16 |
| 2 | Kansas City Chiefs | 26 | 71 | 84 | 84 | 74 | 71 | 97 | 35 | 71 | 48 | 45 | 77 | 94 | 71 | 87 | 44 |
The team names don't match up with our abbreviated team names in the Elo dataset. Because we are mostly interested in assessing the qualities that contribute to wins, and not necessarily which teams have had the best records, we will use the abbreviated team names to save space in the dataframe.
tpc=tpc.replace({'Kansas City Chiefs':'KC','Pittsburgh Steelers':'PIT','New England Patriots':'NE',
'Buffalo Bills':'BUF','Carolina Panthers':'CAR','Seattle Seahawks':'SEA','Indianapolis Colts':'IND',
'Arizona Cardinals':'ARI','Baltimore Ravens':'BAL','Houston Texans':'HOU','New Orleans Saints':'NO',
'Philadelphia Eagles':'PHI','Denver Broncos':'DEN','Detroit Lions':'DET','Minnesota Vikings':'MIN',
'New York Giants':'NYG','Atlanta Falcons':'ATL','Dallas Cowboys':'DAL','Jacksonville Jaguars':'JAX',
'Miami Dolphins':'MIA','Cincinnati Bengals':'CIN','Oakland Raiders':'OAK','Tampa Bay Buccaneers':'TB',
'Los Angeles Rams':'LAR','Chicago Bears':'CHI','Cleveland Browns':'CLE','San Diego Chargers':'LAC',
'San Francisco 49ers':'SF','New York Jets':'NYJ','Washington Redskins':'WSH','Tennessee Titans':'TEN',
'Green Bay Packers':'GB'})
tpc.sample(5)
| TEAM | BMK | UNI | CCH | STX | SMK | AFF | SLP | NYP | FRL | BNG | TRD | BWG | FUT | PLA | OWN | BEH | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 7 | IND | 19 | 74 | 68 | 97 | 81 | 77 | 100 | 61 | 94 | 94 | 71 | 23 | 26 | 87 | 48 | 16 |
| 2 | KC | 26 | 71 | 84 | 84 | 74 | 71 | 97 | 35 | 71 | 48 | 45 | 77 | 94 | 71 | 87 | 44 |
| 31 | TEN | 16 | 55 | 10 | 13 | 84 | 32 | 94 | 48 | 19 | 16 | 32 | 71 | 0 | 6 | 13 | 29 |
| 26 | CLE | 23 | 39 | 16 | 26 | 77 | 42 | 71 | 74 | 10 | 32 | 77 | 87 | 3 | 10 | 10 | 16 |
| 10 | HOU | 87 | 32 | 55 | 71 | 13 | 48 | 52 | 23 | 81 | 35 | 0 | 39 | 61 | 74 | 68 | 95 |
Although the 'NFL Favorite Teams' dataset has many interesting ratings, we are going to concentrate on the following, which help us answer our questions.
We will drop all other columns from the dataset for clarity.
tpc.drop(columns = ['FRL', 'OWN', 'PLA', 'FUT', 'BWG', 'TRD', 'NYP', 'SLP', 'SMK', 'STX', 'UNI', 'BMK'], inplace=True)
tpc.sample(5)
| TEAM | CCH | AFF | BNG | BEH | |
|---|---|---|---|---|---|
| 0 | GB | 81 | 94 | 81 | 69 |
| 19 | JAX | 61 | 100 | 3 | 29 |
| 6 | SEA | 95 | 23 | 100 | 29 |
| 5 | CAR | 52 | 84 | 90 | 44 |
| 30 | WSH | 3 | 0 | 6 | 3 |
We will also relabel 'TEAM' name to 'Winning_Team' so we can more easily merge it with our Elo dataset
tpc.rename(columns = {'BNG':'Bang_for_Buck','BEH':'Behavior','AFF':'Affordability','CCH':'Coaching','TEAM':'Winning_Team'},inplace=True)
tpc.sample(5)
| Winning_Team | Coaching | Affordability | Bang_for_Buck | Behavior | |
|---|---|---|---|---|---|
| 23 | TB | 29 | 68 | 13 | 44 |
| 18 | DAL | 42 | 16 | 77 | 29 |
| 11 | NO | 87 | 87 | 45 | 84 |
| 28 | SF | 0 | 10 | 65 | 95 |
| 22 | OAK | 32 | 65 | 0 | 69 |
We're now ready to merge our two dataframes together.
df = pd.merge(elo, tpc, on = ['Winning_Team'])
Now we can check to make sure our datatypes are consistent.
df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 1332 entries, 0 to 1331 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Game_Date 1332 non-null datetime64[ns] 1 Home_Team 1332 non-null object 2 Away_Team 1332 non-null object 3 Home_Score 1332 non-null float64 4 Away_Score 1332 non-null float64 5 Winning_Team 1332 non-null object 6 Home_Victory 1332 non-null int64 7 Coaching 1332 non-null int64 8 Affordability 1332 non-null int64 9 Bang_for_Buck 1332 non-null int64 10 Behavior 1332 non-null int64 dtypes: datetime64[ns](1), float64(2), int64(5), object(3) memory usage: 124.9+ KB
Because scores in the NFL can only be integers, we will change the Home_Score and Away_Score to the 'int' dtype to reflect this.
df.Home_Score = df.Home_Score.astype('int64')
df.Away_Score = df.Home_Score.astype('int64')
df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 1332 entries, 0 to 1331 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Game_Date 1332 non-null datetime64[ns] 1 Home_Team 1332 non-null object 2 Away_Team 1332 non-null object 3 Home_Score 1332 non-null int64 4 Away_Score 1332 non-null int64 5 Winning_Team 1332 non-null object 6 Home_Victory 1332 non-null int64 7 Coaching 1332 non-null int64 8 Affordability 1332 non-null int64 9 Bang_for_Buck 1332 non-null int64 10 Behavior 1332 non-null int64 dtypes: datetime64[ns](1), int64(7), object(3) memory usage: 124.9+ KB
df.sample(5)
| Game_Date | Home_Team | Away_Team | Home_Score | Away_Score | Winning_Team | Home_Victory | Coaching | Affordability | Bang_for_Buck | Behavior | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 590 | 2013-12-08 | ARI | LAR | 30 | 30 | ARI | 1 | 95 | 90 | 68 | 56 |
| 203 | 2013-11-03 | WSH | LAC | 30 | 30 | WSH | 1 | 3 | 0 | 6 | 3 |
| 400 | 2010-10-17 | HOU | KC | 35 | 35 | HOU | 1 | 55 | 48 | 35 | 95 |
| 506 | 2011-09-25 | TB | ATL | 16 | 16 | TB | 1 | 29 | 68 | 13 | 44 |
| 1206 | 2014-11-09 | DET | MIA | 20 | 20 | DET | 1 | 35 | 39 | 58 | 84 |
We're almost finished compiling our dataset - the last factor that we're interested in documenting in it's own column is the cumulative number of wins per team. We will use a NamedAgg function to generate a column that shows this information.
wins = elo.groupby('Winning_Team').agg(number_wins = pd.NamedAgg(column='Winning_Team', aggfunc='count'))
wins
| number_wins | |
|---|---|
| Winning_Team | |
| ARI | 39 |
| ATL | 47 |
| BAL | 59 |
| BUF | 31 |
| CAR | 35 |
| CHI | 43 |
| CIN | 44 |
| CLE | 25 |
| DAL | 43 |
| DEN | 53 |
| DET | 38 |
| GB | 62 |
| HOU | 41 |
| IND | 48 |
| JAX | 22 |
| KC | 39 |
| LAC | 43 |
| LAR | 29 |
| MIA | 36 |
| MIN | 31 |
| NE | 70 |
| NO | 51 |
| NYG | 45 |
| NYJ | 39 |
| OAK | 27 |
| PHI | 42 |
| PIT | 53 |
| SEA | 57 |
| SF | 55 |
| TB | 27 |
| TEN | 30 |
| WSH | 28 |
We can merge this with our exisitng dataframe to create a complete datset with all the information needed to answer our questions.
df = pd.merge(wins, df, on=['Winning_Team'])
df
| Winning_Team | number_wins | Game_Date | Home_Team | Away_Team | Home_Score | Away_Score | Home_Victory | Coaching | Affordability | Bang_for_Buck | Behavior | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | ARI | 39 | 2010-09-12 | LAR | ARI | 13 | 13 | 0 | 95 | 90 | 68 | 56 |
| 1 | ARI | 39 | 2010-09-26 | ARI | OAK | 24 | 24 | 1 | 95 | 90 | 68 | 56 |
| 2 | ARI | 39 | 2010-10-10 | ARI | NO | 30 | 30 | 1 | 95 | 90 | 68 | 56 |
| 3 | ARI | 39 | 2010-12-12 | ARI | DEN | 43 | 43 | 1 | 95 | 90 | 68 | 56 |
| 4 | ARI | 39 | 2010-12-25 | ARI | DAL | 27 | 27 | 1 | 95 | 90 | 68 | 56 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1327 | WSH | 28 | 2013-11-03 | WSH | LAC | 30 | 30 | 1 | 3 | 0 | 6 | 3 |
| 1328 | WSH | 28 | 2014-09-14 | WSH | JAX | 41 | 41 | 1 | 3 | 0 | 6 | 3 |
| 1329 | WSH | 28 | 2014-10-19 | WSH | TEN | 19 | 19 | 1 | 3 | 0 | 6 | 3 |
| 1330 | WSH | 28 | 2014-10-27 | DAL | WSH | 17 | 17 | 0 | 3 | 0 | 6 | 3 |
| 1331 | WSH | 28 | 2014-12-20 | WSH | PHI | 27 | 27 | 1 | 3 | 0 | 6 | 3 |
1332 rows × 12 columns
We do a quick describe function on our dataframe to get a sense of our data. It doesn't tell us much, but we can see from the mean of our 'Home_Victory' column that the home team wins 57.9% of the time, which is important to answering our question about the significance of home field advantage!
df.describe()
| number_wins | Home_Score | Away_Score | Home_Victory | Coaching | Affordability | Bang_for_Buck | Behavior | |
|---|---|---|---|---|---|---|---|---|
| count | 1332.000000 | 1332.000000 | 1332.000000 | 1332.000000 | 1332.000000 | 1332.000000 | 1332.000000 | 1332.000000 |
| mean | 44.839339 | 24.069069 | 24.069069 | 0.579580 | 53.895646 | 49.379880 | 55.304805 | 51.024775 |
| std | 11.750653 | 10.555554 | 10.555554 | 0.493812 | 30.746125 | 29.840646 | 29.105165 | 30.324444 |
| min | 22.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 38.000000 | 17.000000 | 17.000000 | 0.000000 | 26.000000 | 23.000000 | 29.000000 | 29.000000 |
| 50% | 43.000000 | 24.000000 | 24.000000 | 1.000000 | 55.000000 | 48.000000 | 58.000000 | 56.000000 |
| 75% | 53.000000 | 31.000000 | 31.000000 | 1.000000 | 81.000000 | 77.000000 | 81.000000 | 77.000000 |
| max | 70.000000 | 62.000000 | 62.000000 | 1.000000 | 100.000000 | 100.000000 | 100.000000 | 95.000000 |
We used a pariplot to provide a rough overview of our dataset and get a feel for if there were any correlation between vairables. There were no obvious correlations that we could identify.
plt.figure()
sns.pairplot(df[['Coaching', 'Behavior', 'Bang_for_Buck', 'Affordability','number_wins']])
plt.show()
<Figure size 640x480 with 0 Axes>
We produced initial catplots for each of our advantages of interest to get a feel for how much impact they have.
sns.catplot(data=df, x='Coaching', y='number_wins', kind='bar', palette='RdBu', aspect=30/10)
plt.show()
sns.catplot(data=df, x='Affordability', y='number_wins', kind='bar', palette='RdBu', aspect=30/10)
plt.show()
sns.catplot(data=df, x='Behavior', y='number_wins', kind='bar', palette='RdBu', aspect=30/10)
plt.show()
We decided to use a treemap, which was capable of providing a holistic overview of our data and show our audience how the 32 teams in the NFL compared by in terms of win records and also in terms of their ratings across our 4 categories of interest.
We created a new dataframe that included only the information we needed for our visual. We dropped unnecessary columns as well as our 'Bang for Buck' attribute. Based on our previous visualizations, we felt Affordability was a better attribute to concentrate on for fan expenditures.
tree_df = df.drop(columns=['Game_Date', 'Home_Team', 'Away_Team', 'Home_Score', 'Away_Score', 'Bang_for_Buck'])
We created a column to document the percentage of home victories to standardize the win rates for each team, because not all teams play the same amount of games every season.
tree_df['%_Home_Wins'] = tree_df.groupby('Winning_Team')['Home_Victory'].transform('sum') * 100 / df['number_wins']
Now that we have that information, we no longer need our binary column documenting whether a team won or lost each game, so we dropped that column to make our dataframe more condusive to the treemap.
tree_df.drop(columns='Home_Victory', inplace=True)
tree_df.shape
(1332, 6)
At this point, our dataframe has an entry for every game from the 2010-2015 seasons, so we condense the dataframe by dropping dupplicate rows based on the winning team.
tree_df = tree_df.drop_duplicates(subset = 'Winning_Team', keep='first').reset_index(drop=True)
tree_df.shape
(32, 6)
We renamed our columns so that our attributes are, once again, abbreviations, as these will be easier to read in the sub-boxes of the treemap.
tree_df.rename(columns = {'Winning_Team':'Team','%_Home_Wins':'HW', 'Coaching':'CCH', 'Affordability':'AFF', 'Behavior':'BEH'},inplace=True)
tree_df.columns
Index(['Team', 'number_wins', 'CCH', 'AFF', 'BEH', 'HW'], dtype='object')
We melted the dataframe so that each of our 4 categories (Home Wins, Coaching, Affordability, and Behavior) and their respective rating values become their own rows under the column headers of 'Category' and 'Rating.'
tree_df = tree_df.melt(id_vars = ['Team', 'number_wins'], var_name='Category', value_name = 'Rating')
tree_df
| Team | number_wins | Category | Rating | |
|---|---|---|---|---|
| 0 | ARI | 39 | CCH | 95.000000 |
| 1 | ATL | 47 | CCH | 48.000000 |
| 2 | BAL | 59 | CCH | 90.000000 |
| 3 | BUF | 31 | CCH | 58.000000 |
| 4 | CAR | 35 | CCH | 52.000000 |
| ... | ... | ... | ... | ... |
| 123 | SEA | 57 | HW | 63.157895 |
| 124 | SF | 55 | HW | 54.545455 |
| 125 | TB | 27 | HW | 48.148148 |
| 126 | TEN | 30 | HW | 53.333333 |
| 127 | WSH | 28 | HW | 50.000000 |
128 rows × 4 columns
Our dataframe is now ready to convert into a treemap, which gives us an excellent overview of our data set. It scales the box for each team based on their number of wins compared to other teams, and it uses a rating scale to show how well each team does in each category.
# Code adpated from https://plotly.com/python/treemaps/
import plotly.express as px
fig = px.treemap(tree_df, path=['Team', 'Category'], values='number_wins',
color='Rating', hover_data=['Team'],
color_continuous_scale=['#543005', '#8c510a', '#bf812d', '#d8b365', '#d9f0d3', '#5aae61', '#31a354', '#1b7837', '#00441b'],
color_continuous_midpoint=np.average(tree_df['Rating'], weights=tree_df['number_wins']),
width = 1000, height = 350)
fig.update_layout(margin = dict(t=50, l=25, r=25, b=25))
fig.show()
We can see from our treemap, which shows the highest scoring teams from left to right, that the top 9 teams generally have very strong coaching ratings, which infers that it could be an important factor in team success. Interestingly, not all good teams are unaffordable, as the Green Bay Packers show as the 2nd highest scoring team in terms of overall wins. Bad teams tend to have more bad behavior, as we can see from the 8 worst performing teams. Finally, no team appears to have a poor score in terms of Home Field victories, which goes to show that this could very well be a universal advantage for good and bad teams alike.
We played with an alternative version of the treemap, which mapped the data first by category and then by team, but we felt the first version was more interpretable and helped us tell our story more effectively
# Code adpated from https://plotly.com/python/treemaps/
import plotly.express as px
fig = px.treemap(tree_df, path=['Category', 'Team'], values='number_wins',
color='Rating', hover_data=['Team'],
color_continuous_scale=['#543005', '#8c510a', '#bf812d', '#d8b365', '#d9f0d3', '#5aae61', '#31a354', '#1b7837', '#00441b'],
color_continuous_midpoint=np.average(tree_df['Rating'], weights=tree_df['number_wins']))
fig.update_layout(margin = dict(t=50, l=25, r=25, b=25))
fig.show()
We sorted the data by number of wins in order to see who the best and worst teams in the leage were so we could do a one-to-one comparison of their ratings across attributes.
df.sort_values(by=['number_wins'])
| Winning_Team | number_wins | Game_Date | Home_Team | Away_Team | Home_Score | Away_Score | Home_Victory | Coaching | Affordability | Bang_for_Buck | Behavior | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 620 | JAX | 22 | 2012-01-01 | JAX | IND | 19 | 19 | 1 | 61 | 100 | 3 | 29 |
| 626 | JAX | 22 | 2013-12-05 | JAX | HOU | 27 | 27 | 1 | 61 | 100 | 3 | 29 |
| 625 | JAX | 22 | 2013-12-01 | CLE | JAX | 28 | 28 | 0 | 61 | 100 | 3 | 29 |
| 624 | JAX | 22 | 2013-11-24 | HOU | JAX | 6 | 6 | 0 | 61 | 100 | 3 | 29 |
| 623 | JAX | 22 | 2013-11-10 | TEN | JAX | 27 | 27 | 0 | 61 | 100 | 3 | 29 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 812 | NE | 70 | 2010-10-24 | LAC | NE | 20 | 20 | 0 | 100 | 26 | 84 | 77 |
| 813 | NE | 70 | 2010-10-31 | NE | MIN | 28 | 28 | 1 | 100 | 26 | 84 | 77 |
| 814 | NE | 70 | 2010-11-14 | PIT | NE | 26 | 26 | 0 | 100 | 26 | 84 | 77 |
| 808 | NE | 70 | 2010-09-12 | NE | CIN | 38 | 38 | 1 | 100 | 26 | 84 | 77 |
| 818 | NE | 70 | 2010-12-12 | CHI | NE | 7 | 7 | 0 | 100 | 26 | 84 | 77 |
1332 rows × 12 columns
The radar chart helps us do an effective comparison to highlight differences between the best and worst teams, but it also serves an important purpose in giving our audience insight into our methodology. We felt it was important to include the chart for these reasons.
First we pull the category ratings for the top team (NE).
NE = tree_df.loc[tree_df['Team'] == 'NE'][['Category', 'Rating']]
NE_values = NE['Rating'].to_list()
NE_values
[100.0, 26.0, 77.0, 62.857142857142854]
Then we pull the category ratings for the worst team (JAX).
JAX = tree_df.loc[tree_df['Team'] == 'JAX'][['Category', 'Rating']]
JAX_values = JAX['Rating'].to_list()
JAX_values
[61.0, 100.0, 29.0, 63.63636363636363]
Now we can use this data to generate our radar chart.
#https://plotly.com/python/radar-chart/
import plotly.graph_objects as go
categories = ['Coaching','Affordability', 'Behavior', 'Home Wins']
fig = go.Figure()
fig.add_trace(go.Scatterpolar(
r=NE_values,
theta=categories,
fill='toself',
fillcolor='#00441b',
line_color='#00441b',
opacity=0.7,
name='Patriots'
))
fig.add_trace(go.Scatterpolar(
r=JAX_values,
theta=categories,
fill='toself',
fillcolor='#543005',
line_color='#543005',
opacity=0.6,
name='Jaguars'
))
fig.update_layout(
polar=dict(
radialaxis=dict(
visible=True,
range=[0, 100]
)),
showlegend=True
)
fig.update_layout(title='Direct Team Comparison')
fig.show()
We can see from this direct comparison that tbe Patriots have very strong coaching and player behavior scores, but they are an unaffordable team for fans. Conversely, the Jaguars, have relatively low coaching and behavioral scores, but they have a very high score in the affordability category. Interestingly, both teams are equal in terms of Home Field victories, which supports our general assertion that all teams (good and bad) seem to benefit from that advantage.